59 research outputs found
N-RPN: Hard Example Learning for Region Proposal Networks
The region proposal task is to generate a set of candidate regions that
contain an object. In this task, it is most important to propose as many
candidates of ground-truth as possible in a fixed number of proposals. In a
typical image, however, there are too few hard negative examples compared to
the vast number of easy negatives, so region proposal networks struggle to
train on hard negatives. Because of this problem, networks tend to propose hard
negatives as candidates, while failing to propose ground-truth candidates,
which leads to poor performance. In this paper, we propose a Negative Region
Proposal Network(nRPN) to improve Region Proposal Network(RPN). The nRPN learns
from the RPN's false positives and provide hard negative examples to the RPN.
Our proposed nRPN leads to a reduction in false positives and better RPN
performance. An RPN trained with an nRPN achieves performance improvements on
the PASCAL VOC 2007 dataset
Domain Alignment and Temporal Aggregation for Unsupervised Video Object Segmentation
Unsupervised video object segmentation aims at detecting and segmenting the
most salient object in videos. In recent times, two-stream approaches that
collaboratively leverage appearance cues and motion cues have attracted
extensive attention thanks to their powerful performance. However, there are
two limitations faced by those methods: 1) the domain gap between appearance
and motion information is not well considered; and 2) long-term temporal
coherence within a video sequence is not exploited. To overcome these
limitations, we propose a domain alignment module (DAM) and a temporal
aggregation module (TAM). DAM resolves the domain gap between two modalities by
forcing the values to be in the same range using a cross-correlation mechanism.
TAM captures long-term coherence by extracting and leveraging global cues of a
video. On public benchmark datasets, our proposed approach demonstrates its
effectiveness, outperforming all existing methods by a substantial margin
Leveraging Spatio-Temporal Dependency for Skeleton-Based Action Recognition
Skeleton-based action recognition has attracted considerable attention due to
its compact skeletal structure of the human body. Many recent methods have
achieved remarkable performance using graph convolutional networks (GCNs) and
convolutional neural networks (CNNs), which extract spatial and temporal
features, respectively. Although spatial and temporal dependencies in the human
skeleton have been explored, spatio-temporal dependency is rarely considered.
In this paper, we propose the Inter-Frame Curve Network (IFC-Net) to
effectively leverage the spatio-temporal dependency of the human skeleton. Our
proposed network consists of two novel elements: 1) The Inter-Frame Curve (IFC)
module; and 2) Dilated Graph Convolution (D-GC). The IFC module increases the
spatio-temporal receptive field by identifying meaningful node connections
between every adjacent frame and generating spatio-temporal curves based on the
identified node connections. The D-GC allows the network to have a large
spatial receptive field, which specifically focuses on the spatial domain. The
kernels of D-GC are computed from the given adjacency matrices of the graph and
reflect large receptive field in a way similar to the dilated CNNs. Our IFC-Net
combines these two modules and achieves state-of-the-art performance on three
skeleton-based action recognition benchmarks: NTU-RGB+D 60, NTU-RGB+D 120, and
Northwestern-UCLA.Comment: 12 pages, 5 figure
Global-Local Aggregation with Deformable Point Sampling for Camouflaged Object Detection
The camouflaged object detection (COD) task aims to find and segment objects
that have a color or texture that is very similar to that of the background.
Despite the difficulties of the task, COD is attracting attention in medical,
lifesaving, and anti-military fields. To overcome the difficulties of COD, we
propose a novel global-local aggregation architecture with a deformable point
sampling method. Further, we propose a global-local aggregation transformer
that integrates an object's global information, background, and boundary local
information, which is important in COD tasks. The proposed transformer obtains
global information from feature channels and effectively extracts important
local information from the subdivided patch using the deformable point sampling
method. Accordingly, the model effectively integrates global and local
information for camouflaged objects and also shows that important boundary
information in COD can be efficiently utilized. Our method is evaluated on
three popular datasets and achieves state-of-the-art performance. We prove the
effectiveness of the proposed method through comparative experiments
Occluded Person Re-Identification via Relational Adaptive Feature Correction Learning
Occluded person re-identification (Re-ID) in images captured by multiple
cameras is challenging because the target person is occluded by pedestrians or
objects, especially in crowded scenes. In addition to the processes performed
during holistic person Re-ID, occluded person Re-ID involves the removal of
obstacles and the detection of partially visible body parts. Most existing
methods utilize the off-the-shelf pose or parsing networks as pseudo labels,
which are prone to error. To address these issues, we propose a novel Occlusion
Correction Network (OCNet) that corrects features through relational-weight
learning and obtains diverse and representative features without using external
networks. In addition, we present a simple concept of a center feature in order
to provide an intuitive solution to pedestrian occlusion scenarios.
Furthermore, we suggest the idea of Separation Loss (SL) for focusing on
different parts between global features and part features. We conduct extensive
experiments on five challenging benchmark datasets for occluded and holistic
Re-ID tasks to demonstrate that our method achieves superior performance to
state-of-the-art methods especially on occluded scene.Comment: ICASSP 202
Treating Motion as Option with Output Selection for Unsupervised Video Object Segmentation
Unsupervised video object segmentation (VOS) is a task that aims to detect
the most salient object in a video without external guidance about the object.
To leverage the property that salient objects usually have distinctive
movements compared to the background, recent methods collaboratively use motion
cues extracted from optical flow maps with appearance cues extracted from RGB
images. However, as optical flow maps are usually very relevant to segmentation
masks, the network is easy to be learned overly dependent on the motion cues
during network training. As a result, such two-stream approaches are vulnerable
to confusing motion cues, making their prediction unstable. To relieve this
issue, we design a novel motion-as-option network by treating motion cues as
optional. During network training, RGB images are randomly provided to the
motion encoder instead of optical flow maps, to implicitly reduce motion
dependency of the network. As the learned motion encoder can deal with both RGB
images and optical flow maps, two different predictions can be generated
depending on which source information is used as motion input. In order to
fully exploit this property, we also propose an adaptive output selection
algorithm to adopt optimal prediction result at test time. Our proposed
approach affords state-of-the-art performance on all public benchmark datasets,
even maintaining real-time inference speed
- …